Individual Assignment Part 2: R Packages, Data and Analysis

Vast Challenge 2021 - The Kronos Incident - Mini Challenge 2

Kelly Koh https://www.linkedin.com/in/kellykkw/ (School of Computing and Information Systems, Singapore Management University)https://scis.smu.edu.sg/
06-10-2021

Part 1: Background and Methodology

Part 2: R Packages, Data and Analysis

Part 3: Insights and Conclusion

4. R Packages & Data

4.1 R Packages

The following R packages are used:

hide
# The following R packages are used for respective parts of the workflow:
packages = c( # Loading data from csv files
               'readr',
              # Cleaning and manipulating data 
              'tidyverse', 'DT', 'fuzzyjoin',
              # Manipulate time and date variables
              'lubridate', 'clock', 
              # Plot heat maps, gantt, parallel coordinates and scatter plot graphs
              'ggplot2', 'igraph', 'gghighlight', 'scales', 'ggbreak', 'ggforce','gridExtra',
              # Plot interactive network graph
              'visNetwork',
              # Add interactivity to ggplot charts
              'plotly', 'gganimate', 'quantmod', 'crosstalk',
              # Load maps & manipulate geo-spatial data
              'jpeg', 'grid', 'geohashTools', 'sf', 'tmap', 'patchwork', 'raster', 'rgdal', 'mapview')

for (p in packages) {
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

4.2 Datasets

4.2.1 Credit Card Data

hide
cc <- read_csv("data/cc_data.csv",
               col_types = cols(timestamp = col_datetime(format = "%m/%d/%Y %H:%M"), 
                                location = col_character(), 
                                price = col_double(), 
                                last4ccnum = col_double()))

cc <- cc %>% arrange(last4ccnum,timestamp) %>%
  mutate(last4ccnum = as.factor(last4ccnum))

glimpse(cc)
Rows: 1,490
Columns: 4
$ timestamp  <dttm> 2014-01-06 08:16:00, 2014-01-06 13:27:00, 2014-0~
$ location   <chr> "Brew've Been Served", "Abila Zacharo", "Brew've ~
$ price      <dbl> 14.97, 50.14, 11.92, 45.05, 7.26, 50.36, 62.20, 1~
$ last4ccnum <fct> 1286, 1286, 1286, 1286, 1286, 1286, 1286, 1286, 1~

4.2.2 Loyalty Card Data

hide
loyalty <- read_csv("data/loyalty_data.csv",
               col_types = cols(timestamp = col_date(format = "%m/%d/%Y"), 
                                location = col_character(), 
                                price = col_double(), 
                                loyaltynum = col_character()))

loyalty <- loyalty %>% arrange(loyaltynum,timestamp) %>%
  mutate(loyaltynum = as.factor(loyaltynum))

glimpse(loyalty)
Rows: 1,392
Columns: 4
$ timestamp  <date> 2014-01-06, 2014-01-06, 2014-01-07, 2014-01-07, ~
$ location   <chr> "Hallowed Grounds", "Hippokampos", "Hallowed Grou~
$ price      <dbl> 12.93, 36.54, 8.42, 36.50, 20.53, 3.90, 28.25, 15~
$ loyaltynum <fct> L1107, L1107, L1107, L1107, L1107, L1107, L1107, ~

4.2.3 GPS Data

hide
gps <- read_csv("data/gps.csv",
               col_types = cols(Timestamp = col_datetime(format = "%m/%d/%Y  %H:%M:%S"), 
                                id = col_double(), 
                                lat = col_double(), 
                                long = col_double()))

gps <- gps %>% arrange(id, Timestamp) %>%
  mutate(id = as.factor(id))

glimpse(gps)
Rows: 685,169
Columns: 4
$ Timestamp <dttm> 2014-01-06 07:20:01, 2014-01-06 07:20:03, 2014-01~
$ id        <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
$ lat       <dbl> 36.06646, 36.06634, 36.06615, 36.06613, 36.06595, ~
$ long      <dbl> 24.88258, 24.88259, 24.88258, 24.88258, 24.88262, ~

4.2.4 Car Assignment Data

hide
cars <- read_csv("data/car-assignments.csv",
               col_types = cols(LastName = col_character(), 
                                FirstName = col_character(),
                                CarID = col_double(), 
                                CurrentEmploymentType = col_character(),
                                CurrentEmploymentTitle = col_character()))

cars <- cars %>% arrange(CarID) %>%
  mutate(CarID = as.factor(CarID))

glimpse(cars)
Rows: 44
Columns: 5
$ LastName               <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName              <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID                  <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType  <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~

Beyond the csv files, a tourist map image of Aliba (see Figure 1) and geospatial datasets are provided.

Figure 1

5. Data Transformation And Analysis

5.1 Credit and Loyalty Card Transactions

5.1.1 Correct unrecognized character in location name

Replace the unrecognized characters in Katerina’s Cafe location name in credit card & loyalty data using grep.

hide
# Match names of locations in credit card and loyalty card transactions to check spelling
cc_loc <- cc %>% distinct(location) %>% arrange(location) 

loyalty_loc <- loyalty %>% distinct(location) %>% arrange(location) 

location <- full_join(cc_loc, loyalty_loc, by = "location") %>% 
            arrange(location)

# Replace the unrecognized characters in Katerina's Cafe location in cc & loyalty data
cc[grep("Katerina", cc$location),2] <- "Katerina's Cafe"
loyalty[grep("Katerina", loyalty$location),2] <- "Katerina's Cafe"

5.1.2 Join credit card transactions to loyalty transactions - round 1

Credit card transactions are joined with loyalty card transactions using date, location and spend amount as loyalty card transactions do not have time. Any credit card and loyalty card match that only has 1 transaction will be removed to avoid spurious match.

From the 1,081 matched transactions, there are 56 unique pairs of credit cards and loyalty cards, which points to multiple relationship between credit card and loyalty cards. As it is hard to visualize the interconnection between credit cards and loyalty cards in tabular form, a network graph will be used to visualize the relationship.

hide
# Create new date field for credit card data
cc <- cc %>% 
  mutate(datestamp = as.Date(timestamp))

# Join datasets on exact date, location and price (round 1)
# 1,807 records with spurious matches and non-matches
transact <- cc %>% 
  full_join(loyalty, by = c("datestamp" = "timestamp", "location", "price"))

# Find the unique pairs of credit card and loyalty card 
# 56 pairs of credit card and loyalty card pairs
pairs <- transact %>% 
  group_by(last4ccnum, loyaltynum) %>%
  count() %>%
  # Remove 6 spurious matches
  filter(n>1) %>%
  drop_na() %>%
  arrange(last4ccnum,loyaltynum) %>%
  ungroup()

# 1,081 matched transactions 
transact_match <- transact %>% 
  drop_na() %>%
  left_join(pairs, by = c("last4ccnum","loyaltynum")) %>% 
  drop_na()%>%
  dplyr::select(-n) 

glimpse(transact_match)
Rows: 1,081
Columns: 6
$ timestamp  <dttm> 2014-01-06 08:16:00, 2014-01-06 13:27:00, 2014-0~
$ location   <chr> "Brew've Been Served", "Abila Zacharo", "Brew've ~
$ price      <dbl> 14.97, 50.14, 11.92, 45.05, 7.26, 50.36, 62.20, 1~
$ last4ccnum <fct> 1286, 1286, 1286, 1286, 1286, 1286, 1286, 1286, 1~
$ datestamp  <date> 2014-01-06, 2014-01-06, 2014-01-07, 2014-01-07, ~
$ loyaltynum <fct> L3572, L3288, L3572, L3288, L3572, L3288, L3288, ~

5.1.2 Visualize credit card and loyalty cards relationship using network graph

From the network graph, we observe unexpected relationships between credit card 1286 and loyalty card L3288 and loyalty card L6267 is tied to two credit cards 6691 and 6899.

hide
# Prepare cc & loyalty matching data for network graph format
last4ccnum <- transact_match %>%
  distinct(last4ccnum) %>%
  rename(label = last4ccnum) %>%
  mutate(group = 'cc')

loyaltynum <- transact_match %>%
  distinct(loyaltynum) %>%
  rename(label = loyaltynum) %>%
  mutate(group = 'loyalty')

nodes <- full_join(loyaltynum, last4ccnum, by = c("label","group"))

nodes <- nodes %>% rowid_to_column("id") 

per_route <- transact_match %>%  
  group_by(loyaltynum, last4ccnum) %>%
  summarise(weight = n()) %>% 
  ungroup()

edges <- per_route %>% 
  left_join(nodes, by = c("last4ccnum" = "label")) %>% 
  rename(from = id)

edges <- edges %>% 
  left_join(nodes, by = c("loyaltynum" = "label")) %>% 
  rename(to = id)

edges <- dplyr::select(edges, from, to, weight) %>% 
  filter(weight > 1) %>% 
  mutate(width = weight/5 + 1)

# Plot network graph to show cc and loyalty cards relationship
netwkgraph <- visNetwork(nodes, edges, width = "100%", 
                         main = list(text = "Relationship Between Credit Card and Loyalty Cards",
 style = "font-family:arial;font-size:20px;")) %>% 
  visOptions(highlightNearest = TRUE,
             nodesIdSelection = list(enabled = TRUE,selected = "1")) %>% 
  visGroups(groupname = "cc", color = "lightblue") %>%    
  visGroups(groupname = "loyalty", color = "orange") %>% 
  visLegend(width = 0.1, position = "right", main = list(text = "Card Types",
 style = "font-family:arial;font-size:15px;"))
netwkgraph

5.1.3 Join unmatched credit card transactions to loyalty transactions - round 2

409 loyalty card transactions and 311 credit card transactions were unpaired based on exact date, spend and location. Based on the pairs of credit card and loyalty numbers, we match the remaining unmatched transactions based on date, location and a price range, specifically with loyalty price lower than credit card price.

Results below show that 219 transactions have loyalty card price lower than credit card prices. The lower amount registered on the loyalty card usually ranges from $20-80, in increments of $20s. There is no obvious bias to any location or cards. This suggests that there might be a systemic program or error that led to this outcome e.g. cashback program.

hide
# Create config table based on primary matched pairs of loyalty and credit card numbers detected from network graph 
config <- pairs %>%
  filter(!(last4ccnum %in% c('1286') & loyaltynum %in% c('L3288'))) 

# 1,066 transactions with exact match on date, loyalty number, price and location
exact_match <-  cc %>% 
                left_join(config, by = "last4ccnum") %>% 
                inner_join(loyalty, by = c("datestamp" = "timestamp",
                                           "loyaltynum",
                                           "location","price"))

# 409 cc transactions exclude matches from exact price match
cc_unmatched <- cc %>%
  anti_join(transact_match, by = c("datestamp","last4ccnum","location","price")) 

# 311 loyalty transactions exclude matches from exact price match
loyalty_unmatched <- loyalty %>%
  anti_join(transact_match, by = c("timestamp" = "datestamp","loyaltynum","location","price")) 

# Fuzzy match with loyalty price to be lower or equal to cc price returns 496 entries (round 2)
transact_match_2 <- cc_unmatched %>%
  left_join(config, by = "last4ccnum") %>%
  fuzzy_full_join(loyalty_unmatched, 
                  by = c("loyaltynum", "location",
                           "datestamp"="timestamp","price"="price"),
                  match_fun=list(`==`, `==`, `==`, `>=`)) 

# 219 transactions match with price range with price difference in increments of $20.
transact_match_2 <- transact_match_2 %>%
  drop_na() %>%
  mutate(price_diff = price.x - price.y) %>%
  filter(round(price_diff,0) %in% c(20,40,60,80)) 

# 219 transactions match with price range with price difference in increments of $20.
cashback <- ggplot(transact_match_2, 
            aes(x = price_diff, fill = location.x, 
                text = paste('Price Difference: $', price_diff,
                         '<br>Location: ', location.x,
                         '<br>Count: ', n))) + 
  geom_histogram(boundary = 1, position = "stack") +
  labs(title = "Distribution of Cashback by Locations",
                   x = "Cashback Amount ($)", y = "Count", fill = "Locations") +
  theme_minimal() +
  theme(text = element_text(size=7))
  
ggplotly(cashback, tooltip = c("text"))  

5.1.4 Join unmatched credit card transactions to loyalty transactions - round 3

The remaining unmatched transactions are matched based on location and price (and not date) to check for any delayed postings of credit card transactions. Results below show that 7 Kronos Mart credit card transactions were posted 1 day later than its associated loyalty card transactions.

hide
# 190 cc transactions exclude matches from match round 2 (cashback)
cc_unmatched_2 <- cc_unmatched %>%
  anti_join(transact_match_2, by = c("datestamp", 
                                  "location" = "location.x",
                                  "price" = "price.x",
                                  "last4ccnum")) 

# 92 loyalty transactions exclude matches from match round 2 (cashback)
loyalty_unmatched_2 <- loyalty_unmatched %>%
  anti_join(transact_match_2, by = c("timestamp" = "timestamp.y", 
                                     "loyaltynum" = "loyaltynum.y",
                                     "location" = "location.y",
                                     "price" = "price.y")) 

# Match remaining credit card and loyalty card transactions on location and price (round 3)
transact_match_3 <- full_join(cc_unmatched_2, loyalty_unmatched_2, by = c("location","price"))

# 7 Kronos Mart posts credit card transactions posted 1 day later than loyalty card date
transact_match_3 <- transact_match_3 %>% 
  drop_na() %>%
  mutate(diff_date = datestamp - timestamp.y)
        
# Plot scatterplot of Kronos Mart delayed postings
kronos_delay <- ggplot(transact_match_3 %>%
  mutate(CreditCard = as_date(timestamp.x)) %>%
  dplyr::select(price, location, CreditCard, timestamp.y) %>% 
  rename(LoyaltyCard = timestamp.y) %>%
  pivot_longer(!c(location,price), names_to = "type", values_to = "date"), 
                       aes(x=date, y = price, color = type, label = price)) +
  geom_point() +
  geom_text(check_overlap = TRUE, size = 2, nudge_x = 0.5) +
  scale_x_date(date_label = "%a \n%d %b") +  
  theme_minimal() +
  theme(text = element_text(size=7))+
  labs(title = "Daily Transactions of Kronos Mart by Card Type",
                   x = "Date", y = "Price ($)", color = "Card Type")

ggplotly(kronos_delay)

5.1.5 Join unmatched credit card transactions to loyalty transactions - round 4

The 4th attempt to match the remaining unmatched transactions based on location and price range, whereby loyalty card price will be greater than credit card price, did not return many matches or a pattern. We will disregard the matches and treat the remainder 183 credit card and 85 loyalty card transactions as situations whereby employees used one of the two cards.

hide
# 183 cc transactions exclude matches from match round 3 (delayed posting)
cc_unmatched_3 <- cc_unmatched_2 %>%
  anti_join(transact_match_3, by = c("datestamp", 
                                  "location",
                                  "price",
                                  "last4ccnum")) 

# 85 loyalty transactions exclude matches from match round 3 (delayed posting)
loyalty_unmatched_3 <- loyalty_unmatched_2 %>%
  anti_join(transact_match_3, by = c("timestamp" = "timestamp.y", 
                                     "loyaltynum" = "loyaltynum",
                                     "location",
                                     "price")) 

# Match with loyalty price to be greater or equal to cc price (round 4)
transact_match_4 <- left_join(cc_unmatched_3,config, by = "last4ccnum") %>%
  fuzzy_full_join(loyalty_unmatched_3, by = c("location",
                                              "price",
                                              "datestamp"="timestamp",
                                              "loyaltynum"),
                  match_fun=list(`==`, 
                              `<=`, 
                              `==`,
                              `==`)) %>%
  drop_na()

DT::datatable(transact_match_4, filter = 'top', width = '100%', options = list(scrollX = TRUE))

5.1.6 Final list of transaction records

The final list of spend transactions yielded 1,575 records, which will be explored for patterns and anomalies.

hide
transact_match_clean <- transact_match %>%
  mutate(datestamp_loyalty = datestamp, price_diff = NA, diff_date = NA)

transact_match_2_clean <- transact_match_2 %>%
  dplyr::select(timestamp.x,location.x,price.x,last4ccnum,datestamp,loyaltynum.x,timestamp.y, price_diff) %>%
  rename(timestamp = timestamp.x,
         location = location.x,
         price = price.x,
         loyaltynum = loyaltynum.x,
         datestamp_loyalty = timestamp.y) %>%
  mutate(diff_date = NA)

transact_match_3_clean <- transact_match_3 %>%
  dplyr::select(timestamp.x,location,price,last4ccnum,datestamp,loyaltynum,timestamp.y,diff_date) %>%
  rename(timestamp = timestamp.x, datestamp_loyalty = timestamp.y) %>%
  mutate(price_diff = NA)

loyalty_unmatched_3_clean <- loyalty_unmatched_3 %>%
  rename(datestamp = timestamp) %>%
  mutate(datestamp_loyalty = datestamp)

# Final set of 1,575 credit card and loyalty card transactions
trans_final <- transact_match_clean %>%
  bind_rows(transact_match_2_clean) %>%
  bind_rows(transact_match_3_clean) %>% 
  bind_rows(loyalty_unmatched_3_clean) %>% 
  bind_rows(cc_unmatched_3)

DT::datatable(trans_final, filter = 'top', width = '100%', options = list(scrollX = TRUE))

5.1.7 Understand hourly spend patterns

From the heatmap below, we observed that some locations only have credit card transactions posted in 1 hour within the day e.g. 

Another anomaly that we will investigate is the posting from Kronos Mart at 3AM.

hide
# Summarize credit card data by hour
cc_hour <- trans_final %>% 
  drop_na(timestamp) %>%
  mutate(hour_cc = as.numeric(get_hour(timestamp))) %>% 
  group_by(location,hour_cc) %>% 
  count() %>%
  rename("transactions" = n) %>%
  ungroup() %>%
  complete(location, hour_cc = full_seq(hour_cc, period = 1)) %>%
  mutate(hour_cc = as.character.numeric_version(hour_cc)) 

# Plot cc hourly data as heatmap using geom_tile()
heatmap_cc <- ggplot(cc_hour, aes(x= hour_cc, y = location, fill = transactions, 
                                  text = paste('No. of Transactions:', transactions,
                                              '<br>Transaction Hour: ', hour_cc,
                                              '<br>Location: ', location))) +
  theme_minimal() +
  theme(text = element_text(size=7))+
  geom_tile(colour="white") +
  labs(x = "Hour", y = "Location") +
  scale_fill_gradient(low = "#E0E1E4", high = "#000000") +
  scale_y_discrete(limits = rev) +
  labs(title = "Hourly Credit Card Transactions",
      x = "Hour", fill = "Transactions") 

# Make heatmap interactive with plotly for exploration
ggplotly(heatmap_cc, tooltip = c("text"))

5.1.8 Understand daily spend patterns

From the heatmap below, we see that the top 3 popular locations belong to the Food categories, such as Katerina’s Cafe, Brew’ve Been Served and Hippokampos. Katerina’s Cafe peaks on Saturday, Hippokampus is busier on weekdays than weekends and Brew’ve been Served is closed on weekends.

hide
# Summarize spend data by day
trans_day <- trans_final %>% 
  mutate(date_min = pmin(datestamp, datestamp_loyalty, na.rm=TRUE)) %>%
  group_by(location,date_min) %>% 
  count() %>%
  rename("transactions" = n) %>%
  ungroup() 

break_vec <- trans_day$date_min %>% c(seq(from = min(trans_day$date_min), to = max(trans_day$date_min),by = "day"))

# Plot data as heatmap using geom_tile()
heatmap_spend <- ggplot(trans_day, aes(x= date_min, y = location, fill = transactions, 
                                  text = paste('No. of Transactions:', transactions,
                                              '<br>Transaction Date: ', date_min,
                                              '<br>Location: ', location))) +
  theme_minimal() +
  theme(text = element_text(size=7), axis.text.x=element_text(hjust=1)) +
  geom_tile(colour="white") +
  labs(x = "Date", y = "Location") +
  scale_fill_gradient(low = "#E0E1E4", high = "#000000") +
  scale_y_discrete(limits = rev) +
  scale_x_date(breaks = break_vec, date_label = "%a \n%d %b") +  
  labs(title = "Daily Spend Transactions",
      x = "Date", fill = "Transactions") 

# Make heatmap interactive with plotly for exploration
ggplotly(heatmap_spend, tooltip = c("text"))

5.1.9 Understand spend amount

The highest spend on the credit card is $10k on Frydos Autospply n’ More that does not have a loyalty card transaction, belonging to credit card 9551. 9 credit cards (2276, 3506, 4530, 8642, 7792, 9152, 9220, 9614, 9735) register many high spends that are at industrial areas, likely belonging to the 9 truck drivers.

hide
spend_cc <-  ggplot(trans_final, aes(x = last4ccnum, y = location, size = price*2, color = location, 
                                     text = paste('Last 4 Credit Card Number:', last4ccnum,
                                                  '<br>Loyalty Card Number:', loyaltynum,
                                              '<br>Price: $', price,
                                              '<br>Date: $', datestamp,
                                              '<br>Location: ', location))) +
  geom_point(alpha = 0.7) + 
  theme_minimal() + 
  theme(text = element_text(size=7), axis.text.x=element_text(angle=45, hjust=1)) +
  scale_y_discrete(limits = rev) +
  labs(title = "Spending at Each Location by Credit Card Number",
      x = "Credit Card Number", y = "Location", color = "Location", size = "Price") 

ggplotly(spend_cc, tooltip = c("text"))

5.1.10 Understand Credit Card Spend Frequency

From the chart below, we spot anomalies from the frequency of the spends worth investigating: - Credit card 9551 saw a peak spend of more than $10k with 5 transactions on 13 Jan and registering no spend for 2 days thereafter - Early ends - Credit card 5921’s last spend was on 19 Jan - Late starts - Credit card 3547 started having spend on 12 Jan and 5010 started on 17 Jan - Sporadic records with high spends - Credit cards 2276, 3506, 4530, 7792, 8642 9152, 9220, 9614

hide
ly_count <- trans_final %>%
  group_by(datestamp, last4ccnum) %>%
  summarise(n = n(), total_price = sum(price)) %>% 
  drop_na() %>% ungroup()

ly_count_p <- ggplot(ly_count, aes(x = datestamp, y = n, fill = total_price, 
                                   text = paste('Total Spend $', total_price,
                                              '<br>Frequency:', n,
                                              '<br>Date:', datestamp))) + 
  geom_bar(stat='identity') +
  theme_minimal() +
  theme(axis.title.x = element_blank()) +
  theme(text = element_text(size=7), axis.text.x=element_text(angle=45, hjust=1)) +
  facet_wrap(~last4ccnum) +
  theme(panel.spacing = unit(1, "lines")) +
  labs(title = "Daily Transaction Frequency By Credit Card Number",
      fill = "Total Spend ($)") 

ggplotly(ly_count_p, tooltip = c("text"))

5.2 GPS Logs and Map Layers

5.2.1 Set up map layers using the tourist map provided

We plot the raster layer using tmap to merge the rgb bands and use sf to import the vector GIS files into ESRI shapefile format.

hide
# Import MC2-tourist.tif created using QGIS
ap <- raster("data/Geospatial/MC2-tourist.tif")

# Plotting raster layer using tmap to merge rgb bands
tm_shape(ap) +
tm_rgb(ap, r=1,g=2,b=3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255)
hide
# Use sf to import Vector GIS data files in ESRI shapefile format
abila_st <- st_read(dsn = "data/Geospatial",
                    layer = "Abila")
Reading layer `Abila' from data source `C:\kpokp\blog_ISSS608\_posts\2021-07-25-individual-assignment-part-2\data\Geospatial' using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84

5.2.2 Create movement path from GPS logs

We then create movement path by converting the aspatial data into a data frame and group the paths by id.

hide
# Converting aspatial data into simple feature data frame using st_as_sf
gps_sf <- st_as_sf(gps,
                   coords = c("long","lat"),
                            # EPSG 4326 stands for wgs84 geo. coord sys
                            crs = 4326)

# Group paths by ID and cast as linestring
gps_path <- gps_sf %>%
            group_by(id) %>%
            # Due to peculiar nature of group by, we need to do an action such as summarize
            summarize(m = mean(Timestamp),
                      do_union = FALSE) %>%
            st_cast("LINESTRING") %>%
            ungroup()
# Have to slice the paths by days, or time period to derive meaningful paths

5.2.3 Visualize gps paths by Car ID to detect gps anomalies

Results from the maps show that Car ID 28 and 9 have GPS data issues:

hide
# Zoom in on car 28 & 9
gps_path_selected <- gps_path %>% filter(id %in% c("28","9"))

tmap_mode("view")

tm_shape(ap) +
  tm_rgb(ap, r=1,g=2,b=3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
tm_shape(gps_path_selected) +
  tm_lines() +
tm_facets(by = "id", ncol = 2)

5.2.4 Resolve data issues in Car 28 GPS log by taking the average coordinates over a wider time interval

Car ID 28’s jittery GPS logs resolved by taking the mean coordinates over a 20 second interval.

hide
# Change GPS timestamp to POSIX format
gps$Timestamp <- as.POSIXct(gps$Timestamp, format = "%m/%d/%Y  %H:%M:%S")

# Group timestamp in intervals of 20 seconds and select the mean coordinates to smoothen out the path
gps_28 <- gps %>%
  filter(id == "28") %>%
  mutate(Timestamp = round_date(Timestamp, "20 sec")) %>%
  mutate(lat = round(lat - 00.004,3) ) %>% 
  mutate(long = round(long + 00.004,3) ) %>%
  group_by(Timestamp) %>%
  summarise(lat = mean(lat, na.rm = FALSE), long = mean(long, na.rm = FALSE)) %>% 
  mutate(id = "28") %>%
  ungroup() 

# Visualize car 28 GPS movement
gps_sf_28 <- st_as_sf(gps_28,
                   coords = c("long","lat"),
                            # EPSG 4326 which stands for wgs84 geo. coord sys
                            crs = 4326)

gps_path_28 <- gps_sf_28 %>%
  group_by(id) %>%
  # Due to peculiar nature of group by, we need to do an action such as summarize
  summarize(m = mean(Timestamp),
            do_union = FALSE) %>%
  st_cast("LINESTRING") %>%
  ungroup()

tmap_mode("view")

tm_shape(ap) +
  tm_rgb(ap, r=1,g=2,b=3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
tm_shape(gps_path_28) +
  tm_lines()

5.2.5 Resolve data issues in Car 9 GPS log

Car ID 9 missing GPS records seems to occur randomly. Unfortunately, we are unable to extrapolate the missing GPS records with safe assumptions. We would take note to register events of Car ID 9 by taking the location when the car stopped or start, whichever that makes more sense (i.e. closer to a meaningful location).

hide
# Understand car ID 9's GPS pattern
gps_9 <- gps %>% 
  mutate(datestamp = as_date(Timestamp)) %>%
  mutate(time = as.POSIXct(Timestamp,format = "%H:%M:%S")) %>%
  filter(id == "9", datestamp == "2014-01-07") 

# Plot GPS pattern in scatterplot chart
gps_9_line <- gps_9 %>%
  ggplot(aes(x = time, y = lat)) +
  geom_point() +
  geom_line(color = "red") +
  facet_grid(row = vars(datestamp), scales = "free") +
  labs(title = "Latitude Coordinates of Car ID 9",
                   x = "Time", y = "Latitude")  

ggplotly(gps_9_line)  

5.2.6 Detect and process events from GPS log

Stops between GPS logs provides clues to where the person was heading to. This study will define all stops between GPS log records that spans above 5 minutes as events to be investigated. Each event will have a time span, a location, and a person tied to it. As a rule of thumb, homes of employee will be first identified as locations with pauses between GPS logs exceeding 5 hours for further investigation.

hide
# Replace gps records of Car ID 28
gps_28 <- gps_28 %>%
  dplyr::select(Timestamp, id, lat, long)

gps_clean <- gps %>%
  filter(!id == "28") %>%
  rbind(gps_28) 

# Make car assignment table neat
cars_neat <- cars %>% 
  unite("name", FirstName, LastName, sep = " ") %>%
  rename(department = CurrentEmploymentType , title = CurrentEmploymentTitle)

# Merge car assignments to gps data & sort table by id & timestamp
gps_name <- gps_clean %>%
  left_join(cars_neat, by = c("id" = "CarID")) %>%
  arrange(id, Timestamp) %>%
  mutate(Timestamp = as.POSIXct(Timestamp, format = "%m/%d/%Y  %H:%M:%S"))

# Find difference between timestamp and id
gps_name_temp <- dplyr::select(gps_name,Timestamp)

time_diff <- tail(gps_name_temp,-1) - head(gps_name_temp,-1)

# Join & Tag Start and Stop for Intervals > 5 min
gps_name_diff <- bind_cols(time_diff$Timestamp,tail(gps_name,-1)) %>%
  rename(diff=...1) %>%
  group_by(id) %>%
  
  # If interval btw timestamps is more than 5 mins, tag as an event
  mutate(event_stop = ifelse(diff/60 > 5,'stop',
                          ifelse(diff/60 < 0,'stop',NA))) %>%
  mutate(event_start = lead(event_stop)) %>% 
  
  # If interval between timestamps is more than 5 hours, tag as home to help identify home locations later
  mutate(home = ifelse(diff/60/60 > 5,'home',NA)) %>%
  mutate(diff = seconds_to_period(diff)) %>%
  mutate(event_start = ifelse(event_start == 'stop','start',NA)) %>%
  
  # Encode lat long into geohash level 7    
  mutate(geohash = gh_encode(latitude = lat, 
                               longitude = long, 
                               precision = 7L)) %>%
  ungroup() 

5.2.7 Identify events of Car ID 9 and 28

There is a need to consciously merge the start and stop locations of Car ID 9 as there are places of the interest that are registered by starts (e.g. Bean There Done That) and places registered by stops (e.g. U Pump). We will do the event patching if events identified cannot be matched to a location.

hide
# Inspect events of Car ID 9
events_9 <- gps_name_diff %>% 
          filter(id == "9", event_stop == "stop" | event_start == "start") %>%
          unite(event, c("event_start", "event_stop")) %>%
          dplyr::select(Timestamp, lat, long, event)

# view the locations over the jpeg map
map <- ggplot(events_9, aes(long, lat)) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add points
    geom_point(aes(color=event), size = 3, alpha = 0.8) +
  
  # Add label
    labs(title = "Start and Stop Coordinates of Car ID 9 Events",
                   x = "Longitude", y = "Latitude", color = "Type of Events")

# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map)
layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))
hide
# Inspect events of Car ID 9
events_28 <- gps_name_diff %>% 
          filter(id == "28", event_stop == "stop" | event_start == "start") %>%
          unite(event, c("event_start", "event_stop")) %>%
          dplyr::select(Timestamp, lat, long, event)

# view the locations over the jpeg map
map <- ggplot(events_28, aes(long, lat)) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add points
    geom_point(aes(color=event), size = 3, alpha = 0.8) +
  
  # Add label
    labs(title = "Start and Stop Coordinates of Car ID 28 Events",
                   x = "Longitude", y = "Latitude", color = "Type of Events")

# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map)
layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))

5.2.8 Capture all GPS events

3,067 events with locations that would be later identified as home, places of interest, or unknown/suspicious locations.

hide
# 3,067 events that span more than 5 minutes
events <- gps_name_diff %>%
  filter(event_stop == "stop") %>%
  mutate(event_start_time = Timestamp - diff) %>%
  rename(event_stop_time = Timestamp) %>% 
  mutate(event_date = as_date(event_start_time)) %>% 
  dplyr::select(event_stop, home,event_date, event_start_time, event_stop_time,
                        diff, lat, long, geohash, id, name, department, title) %>%
  filter(diff > 0)

glimpse(events)
Rows: 3,067
Columns: 13
$ event_stop       <chr> "stop", "stop", "stop", "stop", "stop", "st~
$ home             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
$ event_date       <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-0~
$ event_start_time <dttm> 2014-01-06 07:22:04, 2014-01-06 08:04:09, ~
$ event_stop_time  <dttm> 2014-01-06 07:57:01, 2014-01-06 12:17:01, ~
$ diff             <Period> 34M 57S, 4H 12M 52S, 58M 34S, 4H 9M 34S,~
$ lat              <dbl> 36.06371, 36.04802, 36.07668, 36.04803, 36.~
$ long             <dbl> 24.88593, 24.87957, 24.85757, 24.87957, 24.~
$ geohash          <chr> "sw3tnm1", "sw3tn4k", "sw3tjxs", "sw3tn4k",~
$ id               <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ name             <chr> "Nils Calixto", "Nils Calixto", "Nils Calix~
$ department       <chr> "Information Technology", "Information Tech~
$ title            <chr> "IT Helpdesk", "IT Helpdesk", "IT Helpdesk"~

5.2.9 Identify home locations of employees

The interactive map shows that some of the employees stay together e.g. employees with Car ID 30, 22 and 23.

Furthermore, due to GPS issues, employee with Car ID 28 has home locations spread around a concentrated area, while employee with Car ID 9 spreads out over a wider area.

We also picked up GASTech as a location especially for truck drivers, who park their vehicles at the company at the end of their work day.

hide
# Identify homes or place of accommodation of employees - 500 events
home <- events %>% 
  filter(home == "home") 

home_freq <- home %>%
  dplyr::select(id, lat, long, geohash) %>%
  mutate(lat = round(lat,3), long = round(long,3)) %>%              
  group_by(id, lat, long, geohash) %>%
  summarise(n = n()) %>%
  ungroup()

# view the home locations over the jpeg map
map <- ggplot(home_freq, aes(long, lat, text = paste("geohash:", geohash))) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add points
    geom_point(aes(color=id, size = n), alpha = 0.8) +
  
  # Add label
    labs(title = "Likely Home Locations by Car IDs",
                   x = "Longitude", y = "Latitude", color = "Car ID", size = "Size of Bubbles - Frequency")

# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map)
layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))

5.2.10 Identify likely geohash of employees’ homes locations

We will pick out the geohash of the employees’ homes based on the high frequency stays over the 2 weeks.

hide
# Home location config table
location_labels_home <- home_freq %>% 
  group_by(id) %>%
  arrange(desc(n)) %>%
  slice_head() %>% 
  ungroup() %>%
  dplyr::select(geohash, id) %>%
  group_by(geohash) %>% 
  mutate(location = paste0(id, collapse = "/")) %>%
  arrange(geohash) %>%
  slice_head() %>% 
  mutate(label = "Home") %>%
  mutate(location = paste(label,location, sep = " ")) %>%
  dplyr::select(!c(id,label)) %>%
  ungroup()
  
location_labels_home <- home %>% 
  left_join(location_labels_home, by = "geohash") %>%
  drop_na() %>% 
  group_by(location) %>%
  arrange(location) %>%
  slice_head() %>%
  ungroup() %>% 
  mutate(category = "Home") %>%
  dplyr::select("lat","long","geohash","location","category") %>%
  mutate(location = ifelse(geohash == "sw3tn4k", "GASTech", 
                                  ifelse(geohash == "sw3tnw2", "Chostus Hotel",location))) %>%
  mutate(category = ifelse(location == "GASTech", "Company", 
                                  ifelse(location == "Chostus Hotel", "Leisure","Home"))) 

# view the POIs over the jpeg map
map <- ggplot(location_labels_home, aes(long, lat, text = paste('Geohash:', geohash,
                         '<br>Latitude: ', lat,
                         '<br>Longitude: ', long,
                         '<br>Location: ', location))) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add points
    geom_point(aes(color=location), alpha = 0.8, size =4) +
  
  # Add label
    labs(title = "Accommodation Locations by Car IDs",
                   x = "Longitude", y = "Latitude", color = "Accommodation")

# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map, tooltip = "text")
layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))

5.2.11 Identify relationships of employees who share homes

From the parallel coordinate plot, we see that 8 of the 15 employees who stay in shared homes are from the Security department. 2 of the homes - Home 13/15/16/21 and Home 22/23/30 are occupied only by employees of the Security department. Home 13/15/16/21 occupants take up Perimeter and Site Control duties while Home 22/23/30 occupants are from the Badging Office and one of them is a Security Group Manager. The other 2 homes make up of employees from varied departments such as Engineering, Facilities, IT and Security IT.

hide
# Prepare datatset for parallel coordinates plot
cars_home <- cars_neat %>%
  filter(CarID %in% c("22","23","30","6","25","29","17","24","33","13","15","16","21","14","18")) %>%
  mutate(location = ifelse(CarID %in% c("22","23","30"),"Home 22/23/30",
                           ifelse(CarID %in% c("6","25","29"),"Home 6/25/29",
                           ifelse(CarID %in% c("14","18"),"Home 14/18",
                           ifelse(CarID %in% c("17","24","33"),"Home 17/24/33",
                           ifelse(CarID %in% c("13","15","16","21"),"Home 13/15/16/21",NA)))))) %>%
  rename(a.CarID = CarID, b.name = name, c.location = location, d.department = department, e.title = title) %>%
  group_by(b.name, a.CarID, d.department, e.title, c.location) %>%
  summarise(freq = n()) %>%
  gather_set_data(1:5) %>%
  ungroup()

# Plot parallel coordinate plot
home_para_cord <- ggplot(cars_home, aes(x, id = id, split = y, value = freq)) +
  geom_parallel_sets(aes(fill = c.location), alpha = 0.3, axis.width = 0.1) +
  geom_parallel_sets_axes(axis.width = 0.1, fill = "white", colour = "lightgrey") +
  geom_parallel_sets_labels(colour ="black",angle = 360,size = 3) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        axis.text.y = element_blank(),
        axis.line = element_blank()) +
 # Add label
  labs(title = "Characteristics of Employees Sharing Homes",
       x = "Characteristics", fill = "Home Locations") 
  #facet_grid(~location)

home_para_cord

5.2.12 Identify anomalies from home stays

From the interactive timeline, we observe some activities that stand out for further investigation: - ID 1 was not home for 2 nights (6 Jan and 8 Jan) - ID 16 was staying overnight in Aliba starting 7 Jan - ID 21 visited 2 homes from 10 Jan (Home 21 and 35) and visited an unknown location subsequently. Home location was also not detected on night of 14 Jan. - ID 22 and 28 visited unknown locations in the afternoon of 18 Jan - ID 24 was not home on 9 Jan - ID 25 visited an unknown location on 18 Jan that lasted for almost 24 hours - ID 31 was only in Aliba for 4day-3nights (16 to 19 Jan) and stayed at the Chostus hotel - Selected trucks were only parked at GASTech building from 6 Jan to 17 Jan, and they are parked at the GASTech building

hide
# Match home events to home config table based on 150m radius and no duplicate was introduced - 494 events matched, 6 unmatched
# Unknown locations will be investigated
home_matched <- home %>%
  geo_left_join(location_labels_home, by = c("lat", "long"), max_dist = 0.15, distance_col = "distance", unit = c("km")) %>% 
  # Assume employee with car ID 9 to be home
  mutate(location = ifelse(id == 9, "Home 9",location)) %>%
  mutate(category = ifelse(id == 9, "Home",location)) 
  
# Plot timeline data as gantt chart
home_timeline <- ggplot(home_matched,
            aes(text = paste('id:', id,
                         '<br>Event Start Time: ', as.POSIXct(event_start_time, format = "%m/%d/%Y  %H:%M:%S"),
                         '<br>Event Stop Time: ', as.POSIXct(event_stop_time, format = "%m/%d/%Y  %H:%M:%S"),
                         '<br>Location: ', location))) +
  geom_linerange(mapping=aes(y=id, xmin=event_start_time, xmax=event_stop_time, color=location),size=I(1)) +
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", date_label = "%a \n%d %b") + 
  theme_minimal() +
  theme(axis.text.x=element_text(hjust=1)) +
  # Add label
  labs(title = "Home Locations by Car IDs",
       x = "Date", y = "Car ID", color = "Home Locations") 
  # Can add highlight when do not need tooltip
  # gghighlight(is.na(location))


# Make gantt chart interactive
ggplotly(home_timeline, tooltip = c("text"))

5.3 Combine Spending Transactions with Events Detected from GPS

5.3.1 Find out the owners of the credit cards based on X percentage match with GPS events

If the transaction time is within the time span of an event, it is assigned to that event

hide
# Select events that are not staying at home by excluding home - 2,567 events
POI <- events %>%
  filter(is.na(home)) 

# Match POI events with home locations to narrow list of locations to match
POI_matched_home <- POI %>%
  left_join(location_labels_home, by = "geohash")

# Add the unknown locations from home events match to the rest of the events - 7 events
unknown_home_location <- home_matched %>% filter(is.na(location)) %>%
  dplyr::select(event_stop,home,event_date, event_start_time,event_stop_time,diff,lat.x,long.x,geohash.x,id,name,department,title) %>%
  rename(geohash = geohash.x)

# List of events to be matched
POI_2 <- POI_matched_home %>%
  filter(is.na(location)) %>%
  bind_rows(unknown_home_location)

# Select distinct timestamp with each location from transactions
trans_time <- cc %>%
  dplyr::select(location, timestamp, last4ccnum) %>%
  distinct(location, timestamp, last4ccnum) 

# Remove locations that coincided with Home locations and add the unknown events from Home matching              
# Perform fuzzy match of rest of the events and transactions by time windows
POI_matched <- POI_2 %>%
  fuzzy_left_join(trans_time,
                  by = c("event_start_time" = "timestamp", "event_stop_time" = "timestamp"),
                  match_fun = list(`<`,`>`)) 

# Prepare the matched data for visualization - 996 records 
# We would not pick the top matches as some locations are less visited.
POI_matched_2 <- POI_matched %>% 
  dplyr::select(!home) %>% 
  drop_na(location.y) %>% 
  # Round the lat long to reduce effect of spotty gps
  mutate(lat = round(lat.x,3), long = round(long.x,3)) %>% 
  # Remove Car ID 9 due to spotty GPS
  filter(!id == "9") %>%
  dplyr::select(location.y, lat, long, geohash) %>% 
  group_by(location.y, lat, long, geohash) %>% 
  summarise(n = n()) %>%
  ungroup() %>% 
  arrange(location.y, desc(n)) %>% 
  group_by(location.y) %>%
  ungroup() 

glimpse(POI_matched_2)
Rows: 702
Columns: 5
$ location.y <chr> "Abila Airport", "Abila Airport", "Abila Airport"~
$ lat        <dbl> 36.077, 36.063, 36.060, 36.066, 36.052, 36.054, 3~
$ long       <dbl> 24.858, 24.851, 24.858, 24.852, 24.871, 24.900, 2~
$ geohash    <chr> "sw3tjxs", "sw3tjmp", "sw3tjsm", "sw3tjt8", "sw3t~
$ n          <int> 49, 43, 35, 30, 29, 29, 27, 19, 13, 10, 9, 9, 6, ~

5.3.2 Plot the events that coincide with transactions to determine the lat long of POIs

The interactive map shows locations where credit card and gps locations coincide. We notice that the highest frequency matches need not coincide with the POI placement on the map. As such, we visually pick out the geohash of each POI based on the the description on the map, coupled with the highest occurrence.

hide
# view the POIs over the jpeg map
map <- ggplot(POI_matched_2, aes(long, lat, text = paste("geohash:", geohash))) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add points
  geom_point(aes(color=location.y, size = n), alpha = 0.3) +
  
  # Add label
  labs(title = "Likely POI Locations",
                   x = "Longitude", y = "Latitude", color = "Location", size = "Size of Bubbles - Frequency")
 
# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map)
layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.5,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))

Most likely geohashes identified per location:

5.3.3 Create a config table that holds the likely geohashes of each POI

hide
# Create a config table of the POI with its most likely geohash
POI_geohash <- tibble(location = c("Abila Airport","Abila Scrapyard","Abila Zacharo","Ahaggo Museum","Albert's Fine Clothing",
                                   "Bean There Done That","Brew've Been Served","Carlyle Chemical Inc.","Coffee Cameleon",
                                   "Desafio Golf Course","Frank's Fuel","Frydos Autosupply n' More",
                                   "Gelatogalore","General Grocer","Guy's Gyros","Hallowed Grounds","Hippokampos",
                                   "Jack's Magical Beans","Kalami Kafenion","Katerina's Cafe","Kronos Mart","Kronos Pipe and Irrigation",
                                   "Maximum Iron and Steel","Nationwide Refinery","Octavio's Office Supplies","Ouzeri Elian","Roberts and Sons",
                                   "Shoppers' Delight","Stewart and Sons Fabrication","U-Pump","Kronos Capitol"),
                      geohash_likely = c("sw3thfv","sw3tjrh","sw3tjmp","sw3tnpd","sw3tjx7",
                                        "sw3tm2r","sw3tnek","sw3tnhm","sw3tn7s",
                                        "sw3tm9y","sw3tjq8","sw3tnes",
                                        "sw3tjsm","sw3tjse","sw3tnev","sw3tnm1","sw3tjxs",
                                        "sw3tnjb","sw3tjt8","sw3tnee","sw3tjmx","sw3tjj6",
                                        "sw3tjm2","sw3tnk1","sw3tjsk","sw3tjgn","sw3tjt0",
                                        "sw3tjgh","sw3tng3","sw3tjvy","sw3tj7n")) 

POI_geohash_2 <- POI_matched_2 %>%
  left_join(POI_geohash, by = c("geohash" = "geohash_likely")) %>% 
  group_by(location) %>%
  arrange(desc(n)) %>%
  slice_head() %>%
  ungroup() %>%
  dplyr::select(lat, long, geohash, location) %>%
  drop_na() %>%
  
  # as the cc timing did not match gps due to cc posting issues or non-spend locations
  add_row(lat = 36.073, long = 24.864, geohash = "sw3tjycr", location = "Brewed Awakenings")
  # add_row(lat = 36.066, long = 24.850, geohash = "sw3tjmx", location = "Kronos Mart")

# Assign categories to add semantics to points-of-interest (POI)
food <- c("Abila Zacharo","Bean There Done That","Brew've Been Served",
          "Brewed Awakenings","Coffee Cameleon","Coffee Shack","Gelatogalore",
          "Guy's Gyros", "Hallowed Grounds","Hippokampos","Jack's Magical Beans", 
          "Kalami Kafenion", "Katerina's Cafe","Ouzeri Elian")
retail <- c("Albert's Fine Clothing","Daily Dealz","General Grocer","Kronos Mart",
            "Roberts and Sons", "Shoppers' Delight")
gas <- c("Frank's Fuel","U-Pump")
leisure <- c("Abila Airport","Ahaggo Museum","Chostus Hotel","Desafio Golf Course","Kronos Capitol")
company <- c("Abila Scrapyard","Carlyle Chemical Inc.","Frydos Autosupply n' More",
             "Kronos Pipe and Irrigation","Maximum Iron and Steel","GASTech",
             "Nationwide Refinery","Octavio's Office Supplies","Stewart and Sons Fabrication")

# Add categories to config table
location_labels_POI <- POI_geohash_2 %>% mutate(category = 
                    ifelse(location %in% food, "Food",
                    ifelse(location %in% retail, "Retail",
                    ifelse(location %in% gas, "Gas",
                    ifelse(location %in% leisure, "Leisure",
                    ifelse(location %in% company, "Company", NA)))))) 

# Check that POI positions are correctly tagged on the map
map <- ggplot(location_labels_POI, aes(long, lat, text = paste('Category:', category,
                                                        '<br>Location: ', location,
                                                        '<br>Geohash: ', geohash,
                                                        '<br>Lat: ', lat,
                                                        '<br>Long: ', long))) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add points
  geom_point(aes(color= category), alpha = 0.8, size = 3) +
  
  # Add label
  labs(title = "POI Locations", x = "Longitude", y = "Latitude", color = "Category")


# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map, tooltip = c("text"))

layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))

5.3.4 Match the rest of the 2,567 events + 7 unknown events with POIs and home locations

hide
# Create a config table with both POI and home locations
location_labels <- location_labels_POI %>% rbind(location_labels_home) 

# Match rest of events to POI config table based on 120m radius and pick the location that is closest to POI using geo-join
# Unknown locations will be investigated
POI_geo_matched <-  POI_2 %>%
  rename(lat = lat.x, long = long.x) %>%
  dplyr::select(-c(lat.y, long.y, location, category)) %>% 
  geo_left_join(location_labels, by = c("lat", "long"), max_dist = 0.12, distance_col = "distance", unit = c("km")) %>% 
  group_by(id, event_stop_time) %>%
  arrange(desc(distance)) %>%
  slice_head()

5.3.5 Identify unknown events on the map

hide
# Isolate the unknown events
unknown_location <- POI_geo_matched %>% 
  filter(is.na(location)) %>%
  dplyr::select(event_stop,home,event_start_time,event_stop_time,diff,lat.x,long.x,geohash.x,id,name) %>%
  rename(lat = lat.x, long = long.x, geohash = geohash.x)

map <- ggplot(unknown_location, aes(long, lat, text = paste("geohash:", geohash))) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add points
  geom_point(aes(color=id), alpha = 0.8, size = 3) +
  
  # Add label
  labs(title = "Unknown Locations by Car IDs", color = "Car ID")

# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map)

layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))

We observe some patterns from the map above: - Car ID 9’s spotty gps cause events to be identified when there is none. We will ignore the events until we fix the journeys of Car ID 9 with its event start locations. - Excluding events from 9, there are 5 locations that cannot be explained. They are visited by Car ID 13, 15, 21 and 24

5.3.6 Do a match of all events against the 3 catgeories of locations - home, POI and unknown

hide
# Create a config for suspicious locations tagged as unknown category
location_labels_unknown <- unknown_location %>% 
  filter(!id %in% c("9")) %>%
  dplyr::select(lat,long,geohash,id) %>%
  mutate(lat = round(lat,2),long=round(long,2)) %>%
  group_by(lat,long,id) %>%
  slice_head() %>%
  ungroup() %>% 
  group_by(lat,long) %>% 
  mutate(location = paste0(id, collapse = "/")) %>%
  arrange(geohash) %>%
  slice_head() %>% 
  dplyr::select(!c(id,event_stop_time)) %>%
  ungroup() %>%
  mutate(label = "Unknown", category = "Unknown") %>%
  mutate(location = paste(label,location, sep = " ")) %>%
  left_join(unknown_location, by = "geohash") %>%
  dplyr::select(lat.y, long.y, geohash, location, category) %>%
  group_by(geohash,location) %>%
  slice_head() %>%
  ungroup() %>%
  rename(lat = lat.y, long = long.y) 

# Add suspicious locations to the location labels config table
location_labels_all <- location_labels %>% bind_rows(location_labels_unknown)

# Clean home events
home_matched <- filter(home_matched, !is.na(location)) %>%
  dplyr::select(event_stop,home,event_date,event_start_time,event_stop_time,diff,lat.x,long.x,geohash.x,id,name,department,title,location) %>%
  mutate(category = "Home") %>%
  rename(lat=lat.x,long=long.x,geohash=geohash.x) 
# Clean home events
POI_matched_home <- filter(POI_matched_home, !is.na(location)) %>%
  dplyr::select(event_stop,home,event_date,event_start_time,event_stop_time,diff,lat.x,long.x,geohash,id,name,department,title,location,category) %>%
  rename(lat=lat.x,long=long.x)    

# Match 1,429 unmatched POI events that is closest distance to POI to avoid duplicates, 12 unmatched events from Car 9
POI_3 <- POI_2 %>%
  dplyr::select(event_stop,home,event_date,event_start_time,event_stop_time,diff,lat.x,long.x,geohash,id,name,department,title) %>%
  geo_left_join(location_labels_all, by = c("lat.x" = "lat", "long.x" = "long"), max_dist = 0.12, distance_col = "distance", unit = c("km")) %>%
  dplyr::select(event_stop,home,event_date,event_start_time,event_stop_time,diff,lat.x,long.x,geohash.x,id,name,department,title,location,category, distance) %>%
  rename(lat=lat.x,long=long.x,geohash=geohash.x) %>%
  group_by(event_start_time, id) %>%
  arrange(distance) %>%
  slice_head() %>% 
  ungroup()

# Match 12 starting POI events of Car9 that is closest distance to POI to avoid duplicates  
POI_Car9 <- gps_name_diff %>%
  filter((event_start == "start"|event_stop == "stop"), id == "9") %>% 
  left_join(filter(POI_3,is.na(location)), by = c("Timestamp"="event_start_time")) %>% 
  filter(!is.na(geohash.y)) %>%
  dplyr::select(event_stop.y,home.y,event_date, Timestamp, event_stop_time,
                        diff.y, lat.x, long.x, geohash.x, id.y, name.y, department.y, title.y) %>%
  rename(event_stop=event_stop.y,home=home.y,event_start_time=Timestamp,
                        diff=diff.y, lat=lat.x, long=long.x, geohash=geohash.x, id=id.y, name=name.y, department=department.y, title=title.y) %>%
  geo_left_join(location_labels_all, by = c("lat","long"), max_dist = 0.25, distance_col = "distance", unit = c("km")) %>% 
  group_by(event_start_time, id) %>%
  arrange(distance) %>%
  slice_head() %>% 
  ungroup() %>%
  dplyr::select(-c(lat.y,long.y,geohash.y)) %>%
  rename(lat=lat.x, long=long.x,geohash=geohash.x,)

# Match all 3,067 events 
events_matched <- filter(POI_3,!is.na(location)) %>%
  # Add 12 events from Car ID 9  
  bind_rows(POI_Car9) %>%
  # Add 493 homes events 
  bind_rows(home_matched) %>%
  # Add another 1,145 POI events matched to home locations
  bind_rows(POI_matched_home)

# Map all locations
map <- ggplot(location_labels_all, aes(long, lat, text = paste("geohash:", geohash))) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add points
  geom_point(aes(color=location), alpha = 0.8, size = 3) +
  
  # Add label
  labs(title = "Locations", color = "Location")

# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map)

layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))

5.3.7 Plot timeline of events to check if all events are captured

hide
# Change data into long form for line chart
events_matched_long <- events_matched %>% 
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp")

events_matched_long_highlight <- highlight_key(events_matched_long)

# Plot timeline data as gantt chart
events_timeline <- ggplot(events_matched_long_highlight, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=category,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs", y = "Car ID", x = "Date", color = "Location \nCategory") 
  

# Make gantt chart interactive 
ggplotly(events_timeline, tooltip = c("text"))

5.3.8 Informal relationships between individuals through home visits

We want to understand the informal relationships between employees through the homes they visit. We isolated the homes from the event timeline and picked out the homes which were visited by people besides their usual occupants.

hide
# Change data into long form for line chart
events_matched_long_home <- events_matched %>% 
  filter(str_detect(location,"Home")) %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_home <- ggplot(events_matched_long_home, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs", y = "Car ID", x = "Date", color = "Locations") 

# Make gantt chart interactive
ggplotly(events_timeline_home, tooltip = c("text"))
hide
# Change data into long form for line chart
events_matched_long_home2 <- events_matched %>% 
  filter(location == "Home 2") %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_home2 <- ggplot(events_matched_long_home2, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs - Friday Dinner Party at Lars", y = "Car ID", x = "Date", color = "Locations") 


# Make gantt chart interactive
ggplotly(events_timeline_home2, tooltip = c("text"))
hide
# Change data into long form for line chart
events_matched_long_exechome <- events_matched %>% 
  filter(location %in% c("Home 10", "Home 32", "Home 35", "Home 4")) %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_exechome <- ggplot(events_matched_long_exechome, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs - Midnight Surveillance of the Executives' Homes", y = "Car ID", x = "Date", color = "Locations") 

# Make gantt chart interactive
ggplotly(events_timeline_exechome, tooltip = c("text"))
hide
# Change data into long form for line chart
events_matched_long_friendhome <- events_matched %>% 
  filter(location %in% c("Home 14/18", "Home 22/23/30")) %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_friendhome <- ggplot(events_matched_long_friendhome, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs - Dating or Close Friends?", y = "Car ID", x = "Date", color = "Locations") 

# Make gantt chart interactive
ggplotly(events_timeline_friendhome, tooltip = c("text"))

5.3.9 Informal relationships between individuals through leisure

We want to understand the informal relationships between employees through the leisure locations they visit. We isolated the Leisure events from timeline.

hide
# Change data into long form for line chart
events_matched_long_leisure <- events_matched %>% 
  filter(category == "Leisure") %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_leisure <- ggplot(events_matched_long_leisure, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs", y = "Car ID", x = "Date", color = "Locations") 
  
  # Can add highlight when do not need tooltip
  # gghighlight(location == "Home 2") 

# Make gantt chart interactive
ggplotly(events_timeline_leisure, tooltip = c("text"))
hide
# Change data into long form for line chart
events_matched_long_hotel <- events_matched %>% 
  filter(location == "Chostus Hotel") %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_hotel <- ggplot(events_matched_long_hotel, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs - Afternoon Rendezvous at Chostus Hotel", y = "Car ID", x = "Date", color = "Locations") 

# Make gantt chart interactive
ggplotly(events_timeline_hotel, tooltip = c("text"))
hide
# Change data into long form for line chart
events_matched_long_golf <- events_matched %>% 
  filter(location == "Desafio Golf Course") %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_golf <- ggplot(events_matched_long_golf, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs - CEO Joins Executives' Weekly Sunday Golf", y = "Car ID", x = "Date", color = "Locations") 

# Make gantt chart interactive
ggplotly(events_timeline_golf, tooltip = c("text"))
hide
# Change data into long form for line chart
events_matched_long_park <- events_matched %>% 
  filter(location %in% c("Ahaggo Museum","Kronos Capitol")) %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_park <- ggplot(events_matched_long_park, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs - Hang out At Museum and Park", y = "Car ID", x = "Date", color = "Locations") 

# Make gantt chart interactive
ggplotly(events_timeline_park, tooltip = c("text"))

5.3.10 Informal relationships between individuals through meals

hide
# Change data into long form for line chart
events_matched_long_food <- events_matched %>% 
  filter(category == "Food") %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_food <- ggplot(events_matched_long_food, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs", y = "Car ID", x = "Date", color = "Locations") 

# Make gantt chart interactive
ggplotly(events_timeline_food, tooltip = c("text"))
hide
# Change data into long form for line chart
events_matched_long_BA <- events_matched %>% 
  filter(location == "Brewed Awakenings") %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_BA <- ggplot(events_matched_long_BA, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs - Surveillance at Brewed Awakenings", y = "Car ID", x = "Date", color = "Locations") 

# Make gantt chart interactive
ggplotly(events_timeline_BA, tooltip = c("text"))
hide
# Change data into long form for line chart
events_matched_long_GG <- events_matched %>% 
  filter(location == "Guy's Gyros") %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_GG <- ggplot(events_matched_long_GG, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs - Surveillance at Guy's Gyros", y = "Car ID", x = "Date", color = "Locations") 

# Make gantt chart interactive
ggplotly(events_timeline_GG, tooltip = c("text"))
hide
# Change data into long form for line chart
events_matched_long_UK <- events_matched %>% 
  filter(category == "Unknown") %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp") 

# Plot timeline data as gantt chart
events_timeline_UK <- ggplot(events_matched_long_UK, 
                          aes(text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category))) +
  # Add line range
  geom_line(mapping=aes(y=id,x=timestamp,color=location,group=rownum),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Locations by Car IDs - Unknown Locations", y = "Car ID", x = "Date", color = "Locations") 

# Make gantt chart interactive
ggplotly(events_timeline_UK, tooltip = c("text"))

5.3.11 Match Car IDs with credit card and loyalty card IDs

After matching the credit card transactions with GPS events, we discover that 50% threshold is able to pick up most of the matches, except Car ID 9 and 28. Using this rule of thumb, we will pick all credit card and Car ID matches that are above 50% match and the top percentage match for Car ID 28 and 9.

hide
# Fuzzy match all events with credit card and loyalty card transactions to find unique pairs of car ID and credit card & loyalty card IDs
trans_freq <- trans_final %>% group_by(last4ccnum) %>%
  summarise(transaction_freq = n())

spend_match_cc <- trans_final %>%
  fuzzy_left_join(events_matched, 
                  by = c("timestamp" = "event_start_time", 
                        "timestamp" = "event_stop_time", "location" = "location"),
                  match_fun = list(`>`,`<`, `==`)) %>% 
  dplyr::select(name, department, title, id, last4ccnum, loyaltynum) %>%
  group_by(name, department, title, id, last4ccnum, loyaltynum) %>%
  summarise(n = n()) %>% 
  ungroup() %>% 
  left_join(trans_freq, by = "last4ccnum") %>% 
  mutate(pct_match = n/transaction_freq*100)

# Observe the matches that are more than 30%
spend_match_cc_2 <- spend_match_cc %>% 
  filter(pct_match >= 30) 

# Plot chart to visualize 
spend_match_cc_conf <- ggplot(spend_match_cc_2, aes(x=pct_match, y = id, color = last4ccnum, text = paste('Match:', pct_match, 
                                           '<br>id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Credit Card:', last4ccnum))) +
  geom_vline(xintercept=50, linetype="dotted", color = "black") +
  geom_point() +
  theme_minimal() +
  # Modify scale
  scale_y_discrete(limits = rev) +
  # Add label
  labs(title = "All Car IDs, except 9 and 28, have >50% Match \nBetween Credit Card Transactions and GPS Events", y = "Car ID", x = "Percentage Match (%)", color = "Credit Card Number") 

ggplotly(spend_match_cc_conf, tooltip = c("text"))
hide
# Select 9 & 28 top match
spend_match_cc_3 <- spend_match_cc_2 %>% 
  filter(id %in% c("9","28")) %>%
  group_by(id) %>%
  arrange(desc(pct_match)) %>% 
  slice_head() %>%
  ungroup()

# Select matches above 50% match and remove duplicates
car_spend_match <- spend_match_cc_2 %>% 
  filter(pct_match >= 50) %>%
  rbind(spend_match_cc_3) %>%
  group_by(last4ccnum) %>%
  arrange(desc(pct_match)) %>% 
  slice_head() %>%
  ungroup()

# Ownership of Credit Card and Car ID
DT::datatable(car_spend_match, filter = 'top', width = '100%', options = list(scrollX = TRUE), rownames = FALSE)
hide
unmatched_cc <- cc %>% 
  group_by(last4ccnum) %>%
  summarise(n = n()) %>%
  left_join(car_spend_match, by = "last4ccnum") %>% 
  filter(is.na(id)) %>%
  dplyr::select(last4ccnum)

DT::datatable(unmatched_cc, filter = 'top', width = '100%', options = list(scrollX = TRUE), rownames = FALSE)

5.3.12 Plot timeline of events with spending transactions

From the interactive timeline charts of the GPS locations and spend locations, we observe that employees usually make a corresponding spend on their credit cards when visiting a location with some exceptions.

hide
trans_final_id <- car_spend_match %>%
  left_join(trans_final, by = c("last4ccnum")) %>%
  filter(!is.na(id))

# Plot timeline data as gantt chart
g1 <- ggplot() +
  # Add line 
  geom_line(events_matched_long, mapping=aes(y=id,x=timestamp,color=location,group=rownum, 
                                             text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category)),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) 
    

# Plot timeline data as gantt chart
g2 <- ggplot() +
  # Add point 
  geom_point(trans_final_id,mapping=aes(y=id, x=timestamp, color=location,
                                        text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Last 4 Credit Card Num:', last4ccnum,
                                           '<br>Loyalty Num:', loyaltynum.y,
                                           '<br>Price: $', price,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location)),size=1) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "GPS and Spend Transactions by Locations and ID", y = "Car ID", x = "Date", color = "Locations") 

  
# Make gantt chart interactive
subplot(ggplotly(g1, tooltip = c("text")), ggplotly(g2, tooltip = c("text")))

5.3.13 Plot timeline of events that do not match spending transactions

Find transactions that do not match with GPS locations and GPS locations that did not register any spend

hide
# Events with timestamp
trans_final_id <- car_spend_match %>%
  left_join(trans_final, by = c("last4ccnum")) %>%
  filter(!is.na(id)) %>%
  dplyr::select(timestamp, datestamp, datestamp_loyalty, location, id, last4ccnum, loyaltynum.y, price)

# GPS events at POIs
POI_events_only <- events_matched %>% 
  filter(!category %in% c("Home","Unknown")) %>% 
  filter(!location %in% c("GASTech","Kronos Capitol"))

# Fuzzy match all events with credit card and loyalty card transactions to find unique pairs of car ID and credit card & loyalty card IDs
spend_match <- trans_final_id %>%
  fuzzy_full_join(POI_events_only, 
                  by = c("timestamp" = "event_start_time", 
                        "timestamp" = "event_stop_time", 
                        "location" = "location",
                        "id" = "id"),
                  match_fun = list(`>`,`<`,`==`,`==`))

spend_mismatch <- spend_match %>%
  filter(is.na(timestamp)|is.na(event_start_time))

# Change data into long form for line chart
gps_mismatched_long <- spend_mismatch %>% 
  filter(!is.na(event_start_time)) %>% 
  rename(id = id.y, location = location.y) %>% 
  dplyr::select(event_date, event_start_time, event_stop_time, diff, lat, long, geohash, id, name, department, title, location, category) %>%
  mutate(rownum = row_number()) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp")

spend_mismatch_prep <- spend_mismatch %>%
  filter(!is.na(timestamp)) %>%
  rename(id = id.x, location = location.x) %>%
  dplyr::select(timestamp, datestamp, datestamp_loyalty, location, id, last4ccnum, loyaltynum.y, price)
                  
# Plot timeline data as gantt chart
events_timeline_mismatch <- ggplot() +
  # Add line & point
  geom_line(gps_mismatched_long, mapping=aes(y=id,x=timestamp,color=location,group=rownum, 
                                             text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category)),size=1) +
  geom_point(spend_mismatch_prep, mapping=aes(y=id, x=timestamp, fill=location,
                                        text = paste('id:', id,
                                           '<br>Last 4 Credit Card Num:', last4ccnum,
                                           '<br>Loyalty Num:', loyaltynum.y,
                                           '<br>Price: $', price,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location)),size=4, alpha=0.3) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Mismatch Locations", y = "Car ID", x = "Date", color = "Locations", fill = "Locations") 
  
# Make gantt chart interactive
ggplotly(events_timeline_mismatch, tooltip = c("text"))                  

Kronos Mart credit card postings were posted approximately 12 hours after the purchase was made.

hide
# Change data into long form for line chart
trans_final_KM <- car_spend_match %>%
  left_join(trans_final, by = c("last4ccnum")) %>%
  filter(location == "Kronos Mart")

events_matched_long_KM <- events_matched_long %>%
  filter(location == "Kronos Mart")

# Plot timeline data as gantt chart
events_timeline_spend_KM <- ggplot() +
  # Add line & point
  geom_line(events_matched_long_KM, mapping=aes(y=id,x=timestamp,color=location,group=rownum, 
                                             text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category)),size=1) +
  geom_point(trans_final_KM,mapping=aes(y=id, x=timestamp, fill=location,
                                        text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Last 4 Credit Card Num:', last4ccnum,
                                           '<br>Loyalty Num:', loyaltynum.y,
                                           '<br>Price: $', price,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location)),size=4, alpha=0.3) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Kronos Mart - Visits and Spend Transactions Mismatched", y = "Car ID", x = "Date", 
       color = "GPS Locations - Line", fill = "Spend Locations - Circle") 

# Make gantt chart interactive
ggplotly(events_timeline_spend_KM, tooltip = c("text"))

Nils (Car ID 1) credit card (9551) was used at Frydos Autosupply n More on 13 Jan at but he was not present at the location based on his GPS. Nils was actually at Ouzeri Elian. Instead, Minke (Car ID 24) was present at the shop. We can infer that Minke had stolen Nils’ credit card for fraudulent use.

hide
# Change data into long form for line chart
trans_final_FA <- car_spend_match %>%
  left_join(trans_final, by = c("last4ccnum")) %>%
  filter(location == "Frydos Autosupply n' More")

events_matched_long_FA <- events_matched_long %>%
  filter(location == "Frydos Autosupply n' More")

# Plot timeline data as gantt chart
events_timeline_spend_FA <- ggplot() +
  # Add line & point
  geom_line(events_matched_long_FA, mapping=aes(y=id,x=timestamp,color=location,group=rownum, 
                                             text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category)),size=1) +
  geom_point(trans_final_FA,mapping=aes(y=id, x=timestamp, fill=location,
                                        text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Last 4 Credit Card Num:', last4ccnum,
                                           '<br>Loyalty Num:', loyaltynum.y,
                                           '<br>Price: $', price,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location)),size=4, alpha=0.3) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Frydos Autosupply n' More - \nSpend on 13 Jan by Nils' Credit Card Not Supported by GPS", y = "Car ID", x = "Date", 
       color = "GPS Locations - Line", fill = "Spend Locations - Circle") 
  
# Make gantt chart interactive
ggplotly(events_timeline_spend_FA, tooltip = c("text"))

Matching the GPS and spend transactions for the Food establishments with fixed credit card postings shows that the credit card postings are usually done on the same day of the purchase.

hide
# Change data into long form for line chart
trans_final_cafe <- car_spend_match %>%
  left_join(trans_final, by = c("last4ccnum")) %>%
  filter(location %in% c("Bean There Done That", "Brewed Awakenings","Jack's Magical Beans"))

events_matched_long_cafe <- events_matched_long %>%
  filter(location %in% c("Bean There Done That", "Brewed Awakenings","Jack's Magical Beans"))

# Plot timeline data as gantt chart
events_timeline_spend_cafe <- ggplot() +
  # Add line & point
  geom_line(events_matched_long_cafe, mapping=aes(y=id,x=timestamp,color=location,group=rownum, 
                                             text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category)),size=1) +
  geom_point(trans_final_cafe,mapping=aes(y=id, x=timestamp, fill=location,
                                        text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Last 4 Credit Card Num:', last4ccnum,
                                           '<br>Loyalty Num:', loyaltynum.y,
                                           '<br>Price: $', price,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location)),size=4, alpha=0.3) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Cafes with Fixed Posting - \nVisits and Spend Transactions Mismatched", y = "Car ID", x = "Date", 
       color = "GPS Locations - Line", fill = "Spend Locations - Circle") 
  
# Make gantt chart interactive
ggplotly(events_timeline_spend_cafe, tooltip = c("text"))

Kanon (22) and Borrasca (28) seem to be dating based on their frequent co-location.

hide
# Change data into long form for line chart
events_matched_long_2 <- events_matched %>% 
  mutate(event_date_end = as_date(event_stop_time)) %>%
  mutate(flag = ifelse(event_date_end > event_date,1,0)) %>%
  filter(flag == 0) %>%
  filter(id %in% c("22","28")) %>%
  mutate(event_start_time = format(event_start_time, format = "%H:%M:%S")) %>%
  mutate(event_start_time = as.POSIXct(event_start_time, format = "%H:%M:%S")) %>%
  mutate(event_stop_time = format(event_stop_time, format = "%H:%M:%S")) %>%
  mutate(event_stop_time = as.POSIXct(event_stop_time, format = "%H:%M:%S")) %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp")

trans_final_2 <- car_spend_match %>%
  left_join(trans_final, by = c("last4ccnum")) %>%
  filter(id %in% c("22","28")) %>%
  mutate(timestamp = format(timestamp, format = "%H:%M:%S")) %>%
  mutate(timestamp = as.POSIXct(timestamp, format = "%H:%M:%S"))

# Plot timeline data as gantt chart
events_timeline_2 <- ggplot() +
  # Add line range
  geom_line(events_matched_long_2, mapping=aes(y=event_date,x=timestamp,color=location,group=rownum, text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', format(timestamp, format = "%H:%M"),
                                           '<br>Location: ', location,
                                           '<br>Category: ', category)),size=1) +
  geom_point(trans_final_2,mapping=aes(y=datestamp, x=timestamp, fill=location,
                                        text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Last 4 Credit Card Num:', last4ccnum,
                                           '<br>Loyalty Num:', loyaltynum.y,
                                           '<br>Price: $', price,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location)),size=4, alpha=0.3) +
  
  # Change theme
  theme_minimal() +

  # Modify scale
  scale_y_date(breaks="1 day", labels=label_date_short()) +
  scale_x_datetime(breaks="1 hour", labels=time_format(format = "%H:%M", tz = Sys.timezone(location=TRUE))) + 
  theme(axis.text.x=element_text(angle=45, hjust=1)) +

  facet_wrap(~id, ncol = 2) +

  # Add label
  labs(title = "Locations Visted By Kanon(22) and Borrasca(28) - \nCouple Dating?", y = "Event Start Date", x = "Time", 
       color = "GPS Locations - Line", fill = "Spend Locations - Circle") 

# Make gantt chart interactive
ggplotly(events_timeline_2, tooltip = c("text"))

Bertrand changed credit cards after an erratic night on 11 Jan.

hide
# Change data into long form for line chart
events_matched_long_3 <- events_matched %>% 
  mutate(event_date_end = as_date(event_stop_time)) %>%
  mutate(flag = ifelse(event_date_end > event_date,1,0)) %>%
  filter(flag == 0) %>%
  filter(id %in% c("29")) %>%
  mutate(event_start_time = format(event_start_time, format = "%H:%M:%S")) %>%
  mutate(event_start_time = as.POSIXct(event_start_time, format = "%H:%M:%S")) %>%
  mutate(event_stop_time = format(event_stop_time, format = "%H:%M:%S")) %>%
  mutate(event_stop_time = as.POSIXct(event_stop_time, format = "%H:%M:%S")) %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp")

trans_final_3 <- car_spend_match %>%
  left_join(trans_final, by = c("last4ccnum")) %>%
  filter(id %in% c("29")) %>%
  mutate(timestamp = format(timestamp, format = "%H:%M:%S")) %>%
  mutate(timestamp = as.POSIXct(timestamp, format = "%H:%M:%S"))

# Plot timeline data as gantt chart
events_timeline_3 <- ggplot() +
  # Add line range
  geom_line(events_matched_long_3, mapping=aes(y=event_date,x=timestamp,color=location,group=rownum, text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', format(timestamp, format = "%H:%M"),
                                           '<br>Location: ', location,
                                           '<br>Category: ', category)),size=1) +
  geom_point(trans_final_3,mapping=aes(y=datestamp, x=timestamp, fill=location,
                                        text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Last 4 Credit Card Num:', last4ccnum,
                                           '<br>Loyalty Num:', loyaltynum.y,
                                           '<br>Price: $', price,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location)),size=4, alpha=0.3) +
  
  # Change theme
  theme_minimal() +

  # Modify scale
  scale_y_date(breaks="1 day", labels=label_date_short()) +
  scale_x_datetime(breaks="1 hour", labels=time_format(format = "%H:%M", tz = Sys.timezone(location=TRUE))) + 
  theme(axis.text.x=element_text(angle=45, hjust=1)) +

  # Add label
  labs(title = "Locations Visted By Bertrand (29) - \nChanged Credit Card After Erratic Night", y = "Event Start Date", x = "Time", 
       color = "GPS Locations - Line", fill = "Spend Locations - Circle") 

# Make gantt chart interactive
ggplotly(events_timeline_3, tooltip = c("text"))

Gustav seem to have made purchases at outlets nearby his home as there are unexplained credit card transactions made when he was at home.

hide
# Change data into long form for line chart
events_matched_long_5 <- events_matched %>% 
  mutate(event_date_end = as_date(event_stop_time)) %>%
  mutate(flag = ifelse(event_date_end > event_date,1,0)) %>%
  filter(flag == 0) %>%
  filter(id %in% c("9")) %>%
  mutate(event_start_time = format(event_start_time, format = "%H:%M:%S")) %>%
  mutate(event_start_time = as.POSIXct(event_start_time, format = "%H:%M:%S")) %>%
  mutate(event_stop_time = format(event_stop_time, format = "%H:%M:%S")) %>%
  mutate(event_stop_time = as.POSIXct(event_stop_time, format = "%H:%M:%S")) %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp")

trans_final_5 <- car_spend_match %>%
  left_join(trans_final, by = c("last4ccnum")) %>%
  filter(id %in% c("9")) %>%
  mutate(timestamp = format(timestamp, format = "%H:%M:%S")) %>%
  mutate(timestamp = as.POSIXct(timestamp, format = "%H:%M:%S"))

# Plot timeline data as gantt chart
events_timeline_5 <- ggplot() +
  # Add line range
  geom_line(events_matched_long_5, mapping=aes(y=event_date,x=timestamp,color=location,group=rownum, text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', format(timestamp, format = "%H:%M"),
                                           '<br>Location: ', location,
                                           '<br>Category: ', category)),size=1) +
  geom_point(trans_final_5,mapping=aes(y=datestamp, x=timestamp, fill=location,
                                        text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Last 4 Credit Card Num:', last4ccnum,
                                           '<br>Loyalty Num:', loyaltynum.y,
                                           '<br>Price: $', price,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location)),size=4, alpha=0.3) +
  
  # Change theme
  theme_minimal() +

  # Modify scale
  scale_y_date(breaks="1 day", labels=label_date_short()) +
  scale_x_datetime(breaks="1 hour", labels=time_format(format = "%H:%M", tz = Sys.timezone(location=TRUE))) + 
  theme(axis.text.x=element_text(angle=45, hjust=1)) +

  # Add label
  labs(title = "Locations Visted By Gustav (9) - \nPurchase Made on Foot?", y = "Event Start Date", x = "Time", 
       color = "GPS Locations - Line", fill = "Spend Locations - Circle") 

# Make gantt chart interactive
ggplotly(events_timeline_5, tooltip = c("text"))

Minke stole Nil’s credit card.

hide
# Change data into long form for line chart
events_matched_long_4 <- events_matched %>% 
  mutate(event_date_end = as_date(event_stop_time)) %>%
  mutate(flag = ifelse(event_date_end > event_date,1,0)) %>%
  filter(flag == 0) %>%
  filter(id %in% c("1","24")) %>%
  mutate(event_start_time = format(event_start_time, format = "%H:%M:%S")) %>%
  mutate(event_start_time = as.POSIXct(event_start_time, format = "%H:%M:%S")) %>%
  mutate(event_stop_time = format(event_stop_time, format = "%H:%M:%S")) %>%
  mutate(event_stop_time = as.POSIXct(event_stop_time, format = "%H:%M:%S")) %>%
  mutate(rownum = row_number()) %>%
  dplyr::select(!event_stop) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp")

trans_final_4 <- car_spend_match %>%
  left_join(trans_final, by = c("last4ccnum")) %>%
  filter(id %in% c("1","24")) %>%
  mutate(timestamp = format(timestamp, format = "%H:%M:%S")) %>%
  mutate(timestamp = as.POSIXct(timestamp, format = "%H:%M:%S"))

# Plot timeline data as gantt chart
events_timeline_4 <- ggplot() +
  # Add line range
  geom_line(events_matched_long_4, mapping=aes(y=event_date,x=timestamp,color=location,group=rownum, text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Tite:', title,
                                           '<br>Timestamp:', format(timestamp, format = "%H:%M"),
                                           '<br>Location: ', location,
                                           '<br>Category: ', category)),size=1) +
  geom_point(trans_final_4,mapping=aes(y=datestamp, x=timestamp, fill=location,
                                        text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Last 4 Credit Card Num:', last4ccnum,
                                           '<br>Loyalty Num:', loyaltynum.y,
                                           '<br>Price: $', price,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location)),size=4, alpha=0.3) +
  
  # Change theme
  theme_minimal() +

  # Modify scale
  scale_y_date(breaks="1 day", labels=label_date_short()) +
  scale_x_datetime(breaks="1 hour", labels=time_format(format = "%H:%M", tz = Sys.timezone(location=TRUE))) + 
  theme(axis.text.x=element_text(angle=45, hjust=1)) +

  facet_wrap(~id, ncol = 2) +

  # Add label
  labs(title = "Locations Visted By Nils (1) and  Minke (24) - \nMinke Stole Nil's Credit Card?", y = "Event Start Date", x = "Time", color = "GPS Locations - Line", fill = "Spend Locations - Circle") 

# Make gantt chart interactive
ggplotly(events_timeline_4, tooltip = c("text"))

Find transactions that do not match with GPS locations and GPS locations that did not register any spend

hide
# GPS events at POIs
POI_events_only_2 <- events_matched %>% 
  filter(!category %in% c("Home","Unknown")) %>% 
  filter(!location %in% c("GASTech","Kronos Capitol","Kronos Mart","Bean There Done That","Brewed Awakenings","Jack's Magical Beans")) %>%
  filter(!id %in% c("22","28","29"))

# Fuzzy match all events with credit card and loyalty card transactions to find unique pairs of car ID and credit card & loyalty card IDs
spend_match_4 <- trans_final_id %>%
  filter(!location %in% c("GASTech","Kronos Capitol","Kronos Mart","Bean There Done That","Brewed Awakenings","Jack's Magical Beans")) %>%
  filter(!id %in% c("22","28","29")) %>%
  fuzzy_full_join(POI_events_only_2, 
                  by = c("timestamp" = "event_start_time", 
                        "timestamp" = "event_stop_time", 
                        "location" = "location",
                        "id" = "id"),
                  match_fun = list(`>`,`<`,`==`,`==`))

spend_mismatch_2 <- spend_match_4 %>%
  filter(is.na(timestamp)|is.na(event_start_time))

# Change data into long form for line chart
gps_mismatched_long_2 <- spend_mismatch_2 %>% 
  filter(!is.na(event_start_time)) %>% 
  rename(id = id.y, location = location.y) %>% 
  dplyr::select(event_date, event_start_time, event_stop_time, diff, lat, long, geohash, id, name, department, title, location, category) %>%
  mutate(rownum = row_number()) %>%
  pivot_longer(cols = starts_with("event_st"), names_to = "event_start_stop", values_to = "timestamp")

spend_mismatch_prep_2 <- spend_mismatch_2 %>%
  filter(!is.na(timestamp)) %>%
  rename(id = id.x, location = location.x) %>%
  dplyr::select(timestamp, datestamp, datestamp_loyalty, location, id, last4ccnum, loyaltynum.y, price)
                  
# Plot timeline data as gantt chart
events_timeline_mismatch_2 <- ggplot() +
  # Add line & point
  geom_line(gps_mismatched_long_2, mapping=aes(y=id,x=timestamp,color=location,group=rownum, 
                                             text = paste('id:', id,
                                           '<br>Name:', name,
                                           '<br>Department:', department,
                                           '<br>Title:', title,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location,
                                           '<br>Category: ', category)),size=1) +
  geom_point(spend_mismatch_prep_2, mapping=aes(y=id, x=timestamp, fill=location,
                                        text = paste('id:', id,
                                           '<br>Last 4 Credit Card Num:', last4ccnum,
                                           '<br>Loyalty Num:', loyaltynum.y,
                                           '<br>Price: $', price,
                                           '<br>Timestamp:', timestamp,
                                           '<br>Location: ', location)),size=4, alpha=0.3) +
  
  # Change theme
  theme_minimal() + 
  
  # Modify scale
  scale_y_discrete(limits = rev) +
  scale_x_datetime(date_breaks="1 day", labels =label_date_short()) + 
  theme(axis.text.x=element_text(hjust=1)) +
    
  # Add label
  labs(title = "Mismatch Locations", y = "Car ID", x = "Date", color = "GPS Locations - Lines", fill = "Credit Card Locations - Circle") 
  
# Make gantt chart interactive
ggplotly(events_timeline_mismatch_2, tooltip = c("text"))                  

5.4.1 Map view of suspicious locations

Besides looking at the event timeline, we want to observe if there are patterns to the routes taken to the suspicious locations by the suspects.

Routes to unknown locations

hide
# Prepare paths for ID 13
gps_path_1 <- gps_name_diff %>% 
  mutate(day = as.character(get_day(Timestamp))) %>%
  mutate(hour = get_hour(Timestamp)) %>%
  filter(id %in% c("13")) %>%
  filter(hour <= 13, hour >= 11) %>%
  filter(day %in% c("7","9","10","13","15","16","17","18"))

# Map all locations
map <- ggplot(gps_path_1, aes(long, lat)) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add path points
  geom_point(mapping=aes(fill=day), size = 0.3) +
  geom_point(location_labels_all, mapping=aes(long, lat, color=category, text = paste("geohash:", geohash)), alpha = 0.8, size = 3) +
  
  # Add label
  labs(title = "Visits to Suspicious Locations by Inga - Car ID 13", color = "Location", x = "Longitude", y = "Latitude", fill = "Day of the Month")

# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map)

layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))
hide
# Prepare paths for ID 15
gps_path_2 <- gps_name_diff %>% 
  mutate(day = as.character(get_day(Timestamp))) %>%
  mutate(hour = get_hour(Timestamp)) %>%
  filter(id %in% c("15")) %>%
  filter(hour <= 13, hour >= 11) %>%
  filter(day %in% c("7","8","9","11","13","14","15","17"))

# Map all locations
map <- ggplot(gps_path_2, aes(long, lat)) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add path points
  geom_point(mapping=aes(fill=day), size = 0.3) +
  geom_point(location_labels_all, mapping=aes(long, lat, color=category, text = paste("geohash:", geohash)), alpha = 0.8, size = 3) +
  
  # Add label
  labs(title = "Visits to Suspicious Locations by Loreto - Car ID 15", color = "Location", x = "Longitude", y = "Latitude", fill = "Day of the Month")

# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map)

layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))
hide
# Prepare paths for ID 21 Hennie
gps_path_3 <- gps_name_diff %>% 
  mutate(day = as.character(get_day(Timestamp))) %>%
  mutate(hour = get_hour(Timestamp)) %>%
  filter(id %in% c("21")) %>%
  filter(hour <= 13, hour >= 11) %>%
  filter(day %in% c("8","9","10","11","13","16","15","17"))

# Map all locations
map <- ggplot(gps_path_3, aes(long, lat)) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add path points
  geom_point(mapping=aes(fill=day), size = 0.3) +
  geom_point(location_labels_all, mapping=aes(long, lat, color=category, text = paste("geohash:", geohash)), alpha = 0.8, size = 3) +
  
  # Add label
  labs(title = "Visits to Suspicious Locations by Hennie - Car ID 21", color = "Location", x = "Longitude", y = "Latitude", fill = "Day of the Month")

# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map)

layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))
hide
# Prepare paths for ID 24 Minke
gps_path_4 <- gps_name_diff %>% 
  mutate(day = as.character(get_day(Timestamp))) %>%
  mutate(hour = get_hour(Timestamp)) %>%
  filter(id %in% c("24")) %>%
  filter(hour <= 13, hour >= 11) %>%
  filter(day %in% c("7","8","9","10","14","15","16"))

# Map all locations
map <- ggplot(gps_path_4, aes(long, lat)) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add path points
  geom_point(mapping=aes(fill=day), size = 0.3) +
  geom_point(location_labels_all, mapping=aes(long, lat, color=category, text = paste("geohash:", geohash)), alpha = 0.8, size = 3) +
  
  # Add label
  labs(title = "Visits to Suspicious Locations by Minke - Car ID 24", color = "Location", x = "Longitude", y = "Latitude", fill = "Day of the Month")

# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map)

layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))

Routes to surveillance

hide
# Prepare paths for home surveillance
gps_path_5 <- gps_name_diff %>% 
  mutate(day = as.character(get_day(Timestamp))) %>%
  mutate(hour = get_hour(Timestamp)) %>%
  filter(id %in% c("24","15","16","21")) %>%
  filter(hour <= 7, hour >= 0) %>%
  filter(day %in% c("7","9","11","14"))

# Map all locations
map <- ggplot(gps_path_5, aes(long, lat)) + 
    
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey")) +

  # Add path points
  geom_point(mapping=aes(fill=day), size = 0.3) +
  geom_point(location_labels_all %>% filter(category == "Home"), 
             mapping=aes(long, lat, color=location, text = paste("geohash:", geohash)), alpha = 0.8, size = 3) +
  
  # Add label
  labs(title = "Visits to Executives' Homes", color = "Location", x = "Longitude", y = "Latitude", fill = "Day of the Month")

# Use ggplotly to make it interactive so we can pick the right geohash visually
pp <- ggplotly(map)

layout(pp, images = list(
  list(
    source = "https://raw.githubusercontent.com/kpokp/ISSS608_blog/main/_posts/2021-06-10-individual-assignment/images/MC2-tourist.jpg",
    opacity = 0.3,
    layer = "below",
    xref = "x",
    yref = "y",
    sizing="stretch",
    x= 24.8244,
    y= 36.0952,
    sizex= 0.0852,
    sizey= 0.05
  )
))

Please click here for Part 3: Insights and Conclusion